Synthesizing Data Programs

نویسنده

  • Michael J. Cafarella
چکیده

At least two important tasks in modern data management exist outside traditional database models and query languages: data transformation and feature programs. Data transformation is the informal preprocessing code that transforms raw data into a dataset that is appropriate for import into a relational database for deeper analysis. Feature programs transform raw data into a compact piece of training data that is suitable for use in a statistical training procedure. These two tasks are driven by two applications that are intellectually exciting, economically important, and which rightfully garner substantial attention in the database community: data analytics and machine learning. Unfortunately, to date, both data transformation and feature programs have largely existed in an ad hoc netherworld of Python programs and shell scripts. Consider a stock trader engaged in an analytics task, who is interested in downloading a text file that describes large executive stock sales. The trader’s intention is to identify“suspicious stocks” from the executiveSale table described by the text file, then perform a semijoin between the query answer and her preexisting stockHoldings table; the result will identify stocks that she wants to consider selling. But before she can do so, she must do some critical data transformation: she must write a regular expression to capture triples of stockSymbol, executiveName, sharesSold; then multiply the sharesSold number by the share price; then translate numeric values to the correct binary formats; then finally emit the results in a format that can be imported by her RDBMS. Current practice is for the trader to write the above steps in a small but monolithic program in a mixture of Python, XPath, regular expressions, or perhaps some similar languages. It is perhaps not surprising that engineers use a range of tools here: a large part of the problem is that the input data can come described in almost any model, including informal ones. However justified, the resulting programs remain burdensome to write, easy to write incorrectly, hard to maintain in the face of changes to either the input text or the output schema, and opaque to the database optimizer.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SYNTHESIZING EFFICIENTOUT-OF-CORE PROGRAMS FOR BLOCK RECURSIVE ALGORITHMS USING BLOCK-CYCLIC DATA DISTRIBUTIONSy

This paper presents a framework for synthesizing I/O-efficient out-of-core programs for block recursive algorithms , such as the fast Fourier transform and matrix transpositions. The programs are synthesized from tensor (Kronecker) product representations of algorithms. These programs are optimized for a striped two-level memory model where in the out-of-core data can have block-cyclic distribu...

متن کامل

Synthesizing Reactive Programs

Current theoretical solutions to the classical Church’s synthesis problem are focused on synthesizing transition systems and not programs. Programs are compact and often the true aim in many synthesis problems, while the transition systems that correspond to them are often large and not very useful as synthesized artefacts. Consequently, current practical techniques first synthesize a transitio...

متن کامل

Synthesizing E cient Out - of - Core Programs for BlockRecursive Algorithms using Block - Cyclic Data DistributionsyZhiyong

In this paper, we present a framework for synthesizing I/O eecient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory mo...

متن کامل

Synthesizing Eecient Out-of-core Programs for Block Recursive Algorithms Using Block-cyclic Data Distributions

In this paper, we present a framework for synthesizing I/O eecient out-of-core programs for block recursive algorithms, such as the fast Fourier transform (FFT) and block matrix transposition algorithms. Our framework uses an algebraic representation which is based on tensor products and other matrix operations. The programs are optimized for the striped Vitter and Shriver's two-level memory mo...

متن کامل

Automatic Programming for Streams

Most automatic programming research has focused on programs which terminate and which produce output values upon termination. By contrast, programs which operate on streams of data usually do not terminate and usually produce streams of output data during execution. Such stream programs may be specified with a technique which is a generalization of specification techniques for conventional prog...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015